STATS 32: Introduction to R for Undergraduates

Elena Tuzhilina

Oct 7, 2021

http://web.stanford.edu/~elenatuz/courses/stats32-aut2021/

Recap of session 5

Recap of session 5

ALL of these functions take:

  1. A dataset
  2. Some instructions on what to do with the dataset.

Recap of session 5

ALL of these functions take:

  1. A dataset
  2. Some instructions on what to do with the dataset.

The dataset is either:

  1. Passed to the function through a “pipe” %>%, e.g.
df %>% select(day)

Recap of session 5

ALL of these functions take:

  1. A dataset
  2. Some instructions on what to do with the dataset.

The dataset is either:

  1. Passed to the function through a “pipe” %>%, e.g.
df %>% select(day)
  1. Passed as the first argument within the function’s parentheses, e.g.
select(df, day)

Recap of session 5

ALL of these functions return a dataset!

You can do three things with this returned dataset:

  1. Nothing, in which case it prints to screen.
df %>% select(day)
  1. Save it by assigning it to a variable.
df <- df %>% select(day)
  1. Don’t save it, but pass it on to another function using a “pipe” %>%
df %>% select(day) %>% filter(day < 10)

%>% syntax with dplyr

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

Agenda for today

tidyr package

Another useful package for data manipulations.

Let’s consider a dataset: no. of cases for each country

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

gather() function

How to make a line plot of no. of cases by year for each country?

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

gather() function

How to make a line plot of no. of cases by year for each country?

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Probably want something like

ggplot(df) +
    geom_line(aes(x = year, y = cases, group = country))

Problem: Column names are values of the variable year.

gather() function

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

gather() function

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using gather() function in tidyr

(Source: R for Data Science)

gather() function

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using gather() function in tidyr

df %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

gather() function

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using gather() function in tidyr

df <- df %>% gather(`1999`, `2000`, key = "year", value = "cases")

ggplot() +
geom_line(data = df, mapping = aes(x = as.numeric(year), y = cases, col = country))

Functions: R’s workhorse

A function is a named block of code which

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")
## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str(), head(), sd(), ggplot(), select(), …

How to use a function that already exists

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)
## [1] -1

Applying multiple functions

Function calls read “inside out”!

abs(x): computes absolute value of x.

mean(x): computes the average value for x.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

Applying multiple functions

Function calls read “inside out”!

abs(x): computes absolute value of x.

mean(x): computes the average value for x.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

The pipe operator %>% vs direct function call

library(magrittr)
x %>% abs() %>% mean(na.rm = TRUE)
## [1] 2.6

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A (probably) better answer: Documentation in R itself!

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

sample(): Description

sample(): Usage

What comes after the = sign: default value for that argument

sample(): Arguments

sample(): Details

sample(): Value

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  7  9  8  3  5  6 10  1  2  4

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  5  3 10  6  9  4  2  7  8  1
sample(1:10, 10, TRUE)
##  [1]  1  4  1  1  6  9 10  9  8  4

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  4  2  7  1 10  9  6  8  3  5
sample(1:10, 10, TRUE)
##  [1]  9 10  3  9 10  3  7  3  5  4
sample(1:10, TRUE, size = 5)
## [1] 9 6 3 3 9

How to write your own function

Each function in R has

function_name <- function(arguments){
    action
    return(output)
}

How to write your own function

Each function in R has

function_name <- function(arguments){
    action
    return(output)
}

Example: for given \(x\) and \(y\) compute \(x^2+y^2\).

sum_of_squares  <- function(x, y){
    result <- x^2 + y^2
    return(result)
}

You can make it shorter

For given \(x\) and \(y\) compute \(x^2+y^2\).

sum_of_squares  <- function(x, y){
    result <- x^2 + y^2
    return(result)
}

You can drop return.

sum_of_squares  <- function(x, y){
    result <- x^2 + y^2
    result
}

You can make it shorter

For given \(x\) and \(y\) compute \(x^2+y^2\).

sum_of_squares  <- function(x, y){
    result <- x^2 + y^2
    return(result)
}

You can drop return.

sum_of_squares  <- function(x, y){
    result <- x^2 + y^2
    result
}

The last line is the output.

sum_of_squares  <- function(x, y){
   x^2 + y^2
}

To call a function

Set all the arguments to some values.

function_name(arguments = values)

To call a function

Set all the arguments to some values.

function_name(arguments = values)

For example,

sum_of_squares(x = 1, y = 1)
## [1] 2

To call a function

Set all the arguments to some values.

function_name(arguments = values)

For example,

sum_of_squares(x = 1, y = 1)
## [1] 2

If you know the order of parameters in the function, you can drop the parameter names.

sum_of_squares(10, 10)
## [1] 200

Exercise 1

Write a function that computes \(x^y\) given \(x\) and \(y\).

Exercise 1

Write a function that computes \(x^y\) given \(x\) and \(y\).

power <- function(x, y){
    result <- x^y
    return(result)
}

Exercise 1

Write a function that computes \(x^y\) given \(x\) and \(y\).

power <- function(x, y){
    result <- x^y
    return(result)
}

Let’s try.

power(1, 1)
## [1] 1
power(10, 3)
## [1] 1000

Input can be anything

For example, a vector.

sum_of_squares <- function(x){
    sq <- x^2
    result <- sum(sq)
    return(result)
}

Input can be anything

For example, a vector.

sum_of_squares <- function(x){
    sq <- x^2
    result <- sum(sq)
    return(result)
}

Let’s test it.

x <- c(1, 2, 3)
sum_of_squares(x)
## [1] 14

Input can be anything

For example, a list.

avg_score <- function(df){
    s <- df$scores
    result <- mean(s)
    return(result)
}

Input can be anything

For example, a list.

avg_score <- function(df){
    s <- df$scores
    result <- mean(s)
    return(result)
}

Let’s test it.

class <- list(students = c("Mary", "Bob", "Elena"),
          scores = c(9, 9, 1))
class
## $students
## [1] "Mary"  "Bob"   "Elena"
## 
## $scores
## [1] 9 9 1
avg_score(class)
## [1] 6.333333

Output can be anything

For example, a plot.

plot_cases <- function(df){
    plot <- ggplot() +
    geom_line(data = df, mapping = aes(x = as.numeric(year), y = cases, col = country))
    return(plot)
}

Output can be anything

For example, a plot.

plot_cases <- function(df){
    df <- df %>% gather(`1999`, `2000`, key = "year", value = "cases")
    plot <- ggplot() +
    geom_line(data = df, mapping = aes(x = as.numeric(year), y = cases, col = country))
    return(plot)
}

Let’s check!

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
plot_cases(df)

You can return multiple values

Use list() for this inside the return().

You can return multiple values

Use list() for this inside the return().

Example: for each triple x, y, z return maximum, minimum and average.

max_min_avg <- function(x, y, z){
    return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}
max_min_avg(1,2,3)
## $max
## [1] 3
## 
## $min
## [1] 1
## 
## $mean
## [1] 1

You can store the result in a variable.

result <- max_min_avg(1,2,3)
result$max
## [1] 3

You can return multiple values

Use list() for this inside the return().

Example: for each triple x, y, z return maximum, minimum and average.

max_min_avg <- function(x, y, z){
    return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}
max_min_avg(1,2,3)
## $max
## [1] 3
## 
## $min
## [1] 1
## 
## $mean
## [1] 1

You can store the result in a variable.

result <- max_min_avg(1,2,3)
result$max
## [1] 3

You can set some default values for the arguments

Example: for each triple x, y, z return maximum, minimum and average.

max_min_avg <- function(x, y = 0, z = 0){
    return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}

If you skip an argument it is set to the default value.

max_min_avg(1)
## $max
## [1] 1
## 
## $min
## [1] 0
## 
## $mean
## [1] 1
max_min_avg(1, 2)
## $max
## [1] 2
## 
## $min
## [1] 0
## 
## $mean
## [1] 1

Today’s dataset: Drought in California

Data source: United States Drought Monitor (USDM)

USDM: data download

USDM: data selection

The data in Excel









Optional material

USDM: data selection details

tidyr functions: gather and spread

gather: Used when some column names are not variables, but values of a variable

(Source: R for Data Science)

spread: Opposite of gather

(Source: R for Data Science)

tidyr functions: separate and unite

separate: Used to separate values in one column into multiple columns

(Source: R for Data Science)

unite: Opposite of separate

(Source: R for Data Science)